require(mosaic)
require(mosaicData)
require(lmtest)
cols = trellis.par.get()$superpose.symbol$colOften, we want to model a response variable that is binary, meaning that it can take on only two possible outcomes. These outcomes could be labeled “Yes” or “No”, or “True” of “False”, but for all intents and purposes, they can be coded as either 0 or 1. We have seen these types of variables before (as indicator variables), but we always used them as explanatory variables. Creating a model for such a variable as the response requires a more sophisticated technique than ordinary least squares regression. It requires the use of a logistic model.
The data in the Whickham data set (built into mosaicData) contains observations about women, and whether they were alive 20 years after their initial observation. Our goal is to determine how being a smoker affects the probability of being alive, after controlling for age.
data(Whickham)
head(Whickham)## outcome smoker age
## 1 Alive Yes 23
## 2 Alive Yes 18
## 3 Dead Yes 71
## 4 Alive No 67
## 5 Alive No 64
## 6 Alive Yes 38
First, let’s plot the data. Already we meet challenges.
xyplot(outcome ~ age, groups=smoker, data=Whickham, alpha=0.5, pch=19, cex=2, auto.key = TRUE)It is very difficult to tell what, if anything, is happening here. Let’s add a new variable that is the numeric equivalent of the outcome variable.
Whickham = Whickham %>%
mutate(isAlive = 2 - as.numeric(outcome))Once we have a 0/1 vector for our response, we can try plotting again. This plot makes it a little easier to see by jittering the points in the \(y\)-direction (something we couldn’t have done without numeric data).
myplot = xyplot(jitter(isAlive) ~ age, groups=smoker, data=Whickham, alpha=0.5, pch=19, cex=2, ylab="isAlive")
print(myplot)The simplest approach to modeing this would just be to predict the average for every observation.
avg = mean(~isAlive, data=Whickham)
avg #what does this number mean?## [1] 0.7191781
ladd(panel.abline(h=avg, col=cols[1], lwd=3))Certainly we could improve on this with a linear model, but is it appropriate here? Let’s try one.
m1 = lm(isAlive ~ age + smoker, data=Whickham)
summary(m1)##
## Call:
## lm(formula = isAlive ~ age + smoker, data = Whickham)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18176 -0.19223 0.01779 0.26012 0.72295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4725543 0.0301019 48.919 <2e-16 ***
## age -0.0161555 0.0005581 -28.949 <2e-16 ***
## smokerYes 0.0104743 0.0195768 0.535 0.593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3502 on 1311 degrees of freedom
## Multiple R-squared: 0.3942, Adjusted R-squared: 0.3933
## F-statistic: 426.5 on 2 and 1311 DF, p-value: < 2.2e-16
fit.alive = makeFun(m1)
plotFun(fit.alive(age=x, smoker="Yes") ~ x, plot=myplot, lwd=3, add=TRUE)plotFun(fit.alive(age=x, smoker="No") ~ x, col=cols[2], lwd=3, add=TRUE)So, we do not think that a linear model is the way to go. The predicted values aren’t interpetable (what does 0.65 isAlive mean?) and they go outside the realm of possibility (what does 1.24 isAlive mean?).
So, instead of modeling \(\pi\) (the response variable) like this, \[ \pi = \beta_0 + \beta_1 X \]
suppose we modeled it like this, \[ \log \left( \frac{\pi}{1-\pi} \right) = logit(\pi) = \beta_0 + \beta_1 X \]
This transformation is called the logit function. What are the properties of this function? Note that this implies that \[ \pi = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \in (0,1) \]
The logit function constrains the fitted values to line within \((0,1)\), which helps to give a natural interpretation as the probability of the response actually being 1.
Fitting a logistic curve is mathematically more complicated than fitting a least squares regression, but the syntax in R is similar, as is the output. The procedure for fitting is called maximum likelihood estimation, and the usual machinery for the sum of squares breaks down. Consequently, there is no notion of \(R^2\), etc.
logm = glm(isAlive ~ age + smoker, data=Whickham, family=binomial)
summary(logm)##
## Call:
## glm(formula = isAlive ~ age + smoker, family = binomial, data = Whickham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2795 -0.4381 0.2228 0.5458 1.9581
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.599221 0.441231 17.223 <2e-16 ***
## age -0.123683 0.007177 -17.233 <2e-16 ***
## smokerYes -0.204699 0.168422 -1.215 0.224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1560.32 on 1313 degrees of freedom
## Residual deviance: 945.02 on 1311 degrees of freedom
## AIC: 951.02
##
## Number of Fisher Scoring iterations: 6
The procedure for adding the logistic curve to the plot is the same as it was before.
fit.outcome = makeFun(logm)
plotFun(fit.outcome(age=x, smoker="Yes") ~ x, lwd=3, plot=myplot, add=TRUE)plotFun(fit.outcome(age=x, smoker="No") ~ x, col=cols[2], lwd=3, add=TRUE)smoker term affect the model?We can think of logistic regression models in 3 different “spaces.” For this example, lets just think about one predictor.
logm2 = glm(isAlive ~ age , data=Whickham, family=binomial(logit))The first space is the linear model space, where we are considering log-odds.
xyplot(log(fitted.values(logm2)/(1-fitted.values(logm2)))~age,
data=Whickham, type=c("l"),ylab="log odds") One is the odds space,
xyplot(fitted.values(logm2)/(1-fitted.values(logm2))~age, data=Whickham,
type="spline", ylab="odds")Another is the probability space,
plotModel(logm2, ylab="probability")Another way to make sense of a binary response variable is to bin the explanatory variable and then compute the average proportion of the response within each bin.
Whickham = Whickham %>%
mutate(ageGroup = cut(age, breaks=10))
favstats(~isAlive | ageGroup, data=Whickham)## ageGroup min Q1 median Q3 max mean sd n missing
## 1 (17.9,24.6] 0 1 1 1 1 0.9763780 0.1524700 127 0
## 2 (24.6,31.2] 0 1 1 1 1 0.9790576 0.1435679 191 0
## 3 (31.2,37.8] 0 1 1 1 1 0.9567901 0.2039597 162 0
## 4 (37.8,44.4] 0 1 1 1 1 0.9097222 0.2875796 144 0
## 5 (44.4,51] 0 1 1 1 1 0.8013245 0.4003310 151 0
## 6 (51,57.6] 0 0 1 1 1 0.7109375 0.4551083 128 0
## 7 (57.6,64.2] 0 0 1 1 1 0.6071429 0.4898456 168 0
## 8 (64.2,70.8] 0 0 0 0 1 0.2282609 0.4220114 92 0
## 9 (70.8,77.4] 0 0 0 0 1 0.1382979 0.3470634 94 0
## 10 (77.4,84.1] 0 0 0 0 0 0.0000000 0.0000000 57 0
Although this is not the preferred method for performing logistic regression, it can be illustrative to see how the logistic curve fits through this series of points.
# print(myplot)
binned.y = mean(~isAlive | ageGroup, data=Whickham)
binned.x = mean(~age | ageGroup, data=Whickham)
binplot = xyplot(binned.y ~ binned.x, cex=2, pch=19, col="orange", type=c("p", "r"), lwd=3)
plotFun(fit.outcome(age=x, smoker="Yes") ~ x, lwd=3, add=TRUE, plot=binplot)plotFun(fit.outcome(age=x, smoker="No") ~ x, col=cols[2], lwd=3, add=TRUE)Consider now the difference between the fitted values and the link values. Although the fitted values do not follow a linear pattern with respect to the explanatory variable, the link values do. To see this, let’s plot them explicitly against the logit of the binned values.
xyplot(logit(binned.y) ~ binned.x, pch=19, cex=2, col="orange", type=c("p", "l"))## Warning in logit(binned.y): proportions remapped to (0.025, 0.975)
Whickham = Whickham %>%
mutate(logm.link = predict(logm, type="link"))
ladd(with(subset(Whickham, smoker=="Yes"), panel.xyplot(age, logm.link, col=cols[1], type="l")))ladd(with(subset(Whickham, smoker=="No"), panel.xyplot(age, logm.link, col=cols[2], type="l")))Note how it is considerably easier for us to assess the quality of the fit visually using the link values, as opposed to the binned probabilities.
GPA, with Gender as a covariate using the MedGPA data set.require(Stat2Data)
data(MedGPA)The interpretation of the coefficients in a linear regression model are clear based on an understanding of the geometry of the regression model. We use the terms intercept and slope because a simple linear regression model is a line. In a simple logistic model, the line is transformed by the logit function. How do the coefficients affect the shape of the curve in a logistic model?
The following manipulate function will allow you to experiment with changes to the intercept and slope coefficients in the simple logistic model for isAlive as a function of age.
log.whickham = function (intercept.offset = 0, slope.multiple=1,...) {
# data(Whickham)
Whickham = Whickham %>%
mutate(isAlive = 2 - as.numeric(outcome))
logm = glm(isAlive ~ age, data=Whickham, family=binomial(logit))
fit.outcome = makeFun(logm)
xyplot(jitter(isAlive) ~ age, groups=smoker, data=Whickham,
ylab="isAlive",
panel = function (x,y,...) {
panel.xyplot(x,y, alpha=0.3, pch=19, cex=2,...)
panel.curve(fit.outcome(x), col="darkgray", lty=3, lwd=3)
panel.curve(fit.outcome(x * slope.multiple + intercept.offset), lwd=3)
}
)
}
require(manipulate)
manipulate(log.whickham(intercept.offset, slope.multiple),
intercept.offset = slider(-20, 20, step=1, initial=0),
slope.multiple = slider(0, 5, step=0.25, initial=1)
)For another way to think about the data, see this shiny app: http://45.55.32.181/shiny/log_app/
We saw earlier that the link values are linear with respect to the explanatory variable. The link values are the \(\log\) of the odds. Note that if an event occurs with proability \(\pi\), then \[ odds = \frac{\pi}{1-\pi}, \qquad \pi = \frac{odds}{1+odds}. \] Note that while \(\pi \in [0,1]\), \(odds \in (0,\infty)\). Thus, we can interpret \(\hat{\beta_1}\) as the typical change in \(\log{(odds)}\) for each one unit increase in the explanatory variable. More naturally, the odds of success are multiplied by \(e^{\hat{\beta_1}}\) for each one unit increase in the explanatory variable, since this is the odds ratio. \[ \begin{aligned} odds_X &= \frac{\hat{\pi}_X}{1 - \hat{\pi}_X} = e^{\hat{\beta}_0 + \hat{\beta}_1 X} \\ odds_{X+1} &= \frac{\hat{\pi}_{X+1}}{1 - \hat{\pi}_{X+1}} = e^{\hat{\beta}_0 + \hat{\beta}_1 (X + 1)} \\ \frac{odds_{X+1}}{odds_X} &= \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 (X + 1)}}{e^{\hat{\beta}_0 + \hat{\beta}_1 X}} = e^{\hat{\beta}_1} \end{aligned} \] Furthermore, since the logits are linear with respect to the explanatory variable, these changes are constant.
Finding confidence intervals for the odds ratio is easy. We could look at the point estimate and standard error from the summary() and use the qnorm() function to find a critical z value, or we could just use the convenience function confint().
exp(confint(logm))## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 870.1646058 4916.0147645
## age 0.8708442 0.8957213
## smokerYes 0.5845122 1.1318806
Three of the conditions we require for linear regression have analogs for logistic models:
However, the requirements for Constant Variance and Normality are no longer applicable. In the first case, the variability in the response now inherently depends on the value, so we know we won’t have constant variance. In the second case, there is no reason to think that the residuals will be normally distributed, since the “residuals” are can only be computed in relation to 0 or 1. So in both cases the properties of a binary response variable break down the assumptions we made previously.
Moreover, since we don’t have any sums of squares, we can’t use \(R^2\), ANOVA, or \(F\)-tests. Instead, since we fit the model using maximum likelihood estimation, we compute the likelihood of our model. \[ L(y_i) = \begin{cases} \hat{\pi} & \text{if } y_i=1 \\ 1-\hat{\pi} & \text{if } y_i=0 \end{cases},\qquad L(model) = \prod_{i=1}^n L(y_i) \] Because these numbers are usually very small (why?), it is more convenient to speak of the log-likelihood \(\log(L)\), which is always negative. A larger \(\log(L)\) is closer to zero and therefore a better fit.
The log-likelihood is easy to retrieve
logLik(logm)## 'log Lik.' -472.5098 (df=3)
but is nearly as easy to calculate directly.
pi= fitted.values(logm)
likelihood = ifelse(Whickham$isAlive == 1, pi, 1 - pi)
log(prod(likelihood))## [1] -472.5098
The closest thing to an analog of the \(F\)-test is the Likelihood Ratio Test (LRT). Here, our goal is to compare the log-likelihoods of two models: the one we build vs. the constant model. This is similar to the way we compared the sum of the squares explained by a linear regression model to the model that consists solely of the grand mean.
The null hypothesis in the LRT is that \(\beta_1 = \beta_2 = \cdots = \beta_k = 0\). The alternative hypothesis is that at least one of these coefficients is non-zero. The test statistic is: \[ G = -2\log(\text{constant model}) - (-2 \log(\text{model})). \] These two quantities are known as deviances. It can be shown that \(G\) follows a \(\chi^2\) distribution with \(k\) degrees of freedom. This allows us to compute a \(p\)-value for the model.
lrtest(logm)## Likelihood ratio test
##
## Model 1: isAlive ~ age + smoker
## Model 2: isAlive ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -472.51
## 2 1 -780.16 -2 615.3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this sense the LRT has obvious similarities to the ANOVA table and \(F\)-test. In the same way that we previously performed a nested \(F\)-test to assess the usefulness of a group of predictors, we can perform a nested LRT.
Adding interaction or quadratic terms works in much the same way as it did with linear regression.
linteract = glm(isAlive ~ age + smoker + age*smoker, data=Whickham, family=binomial)
summary(linteract)##
## Call:
## glm(formula = isAlive ~ age + smoker + age * smoker, family = binomial,
## data = Whickham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3983 -0.4256 0.2163 0.5598 1.9283
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.169231 0.606600 13.467 <2e-16 ***
## age -0.133231 0.009953 -13.386 <2e-16 ***
## smokerYes -1.457843 0.837232 -1.741 0.0816 .
## age:smokerYes 0.022235 0.014495 1.534 0.1250
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1560.32 on 1313 degrees of freedom
## Residual deviance: 942.68 on 1310 degrees of freedom
## AIC: 950.68
##
## Number of Fisher Scoring iterations: 6
Suppose now that we suspect that there are diminishing returns to the extent to which being alive is associated with a person’s age. We can easily add quadratic terms.
lquad = glm(isAlive ~ age + smoker + age*smoker + I(age^2) + I(age^2):smoker, data=Whickham, family=binomial)
summary(lquad)##
## Call:
## glm(formula = isAlive ~ age + smoker + age * smoker + I(age^2) +
## I(age^2):smoker, family = binomial, data = Whickham)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7524 -0.2808 0.2680 0.5172 2.1367
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.9879917 1.4181591 2.107 0.035122 *
## age 0.0737044 0.0571073 1.291 0.196832
## smokerYes 1.5039169 2.1311556 0.706 0.480386
## I(age^2) -0.0019389 0.0005542 -3.499 0.000467 ***
## age:smokerYes -0.0915658 0.0866460 -1.057 0.290611
## smokerYes:I(age^2) 0.0010145 0.0008558 1.185 0.235879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1560.32 on 1313 degrees of freedom
## Residual deviance: 928.94 on 1308 degrees of freedom
## AIC: 940.94
##
## Number of Fisher Scoring iterations: 6
How can we assess whether these terms are warranted? Just like the nested \(F\)-test, the nested LRT gives us information about the incremental contribution of a set of terms to our model.
lrtest(logm, linteract, lquad)## Likelihood ratio test
##
## Model 1: isAlive ~ age + smoker
## Model 2: isAlive ~ age + smoker + age * smoker
## Model 3: isAlive ~ age + smoker + age * smoker + I(age^2) + I(age^2):smoker
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -472.51
## 2 4 -471.34 1 2.3391 0.126163
## 3 6 -464.47 2 13.7443 0.001036 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(myplot)fit.qalive = makeFun(lquad)
plotFun(fit.outcome(age=x, smoker="Yes") ~ x, add=TRUE, lty=2)plotFun(fit.outcome(age=x, smoker="No") ~ x, col=cols[2], add=TRUE, lty=2)plotFun(fit.qalive(age=x, smoker="Yes") ~ x, add=TRUE)plotFun(fit.qalive(age=x, smoker="No") ~ x, col=cols[2], add=TRUE)